note to the fellow colegue examiner: This notebook has the entire assignment, not only part 1 or part 2 as many did. You are reading a HTML version of my notebook. The live version is available as a Jupyter Notebook (jpynb) in my GitHub
In this assignment, I decided to explore the city of São Paulo, Brazil, which is the city where I was born.
Not only because I was born there, but maily, because I wanted to know What kind of data it is available in a city outside of major countries and international cities.
Another raeson I decide to use São Paulo as a case of study is based in a recent discussion I had with a colegue of mine, about a very distinct business type that accur in Brazil, named "Padaria" which loosely translates as 'Bakary' but not quite.
'Padarias' in Brazil are quite diferent for Bakeries in North America and Europe.
We can be amazed that the main product of a 'Padaria' are not bread as you can expect.
Of course you find bread in a Brazilin 'Padaria' but not the variety you see in other Bakeries.
Simply saying, bread are not their focus. For sure they have a few type of cakes and breads, the most basic ones, produced several times during the day, as the fresh baked breads are very appreciated in Brazil and bring people to the Padaria that buys several suplementary products like in a delicatessen in USA and Europe: Ham, salame (peperoni), sliced cheeses, juices (orange mostly), filtered coffee (there are expresso too but filtered coffee is the prefered one), yogurt, milk (in Brazil you buy your dairy produts at the bakary, not at the supermarket, generally speaking, at dayly basis)
Another distinctive aspect of Padarias are they are a place where you make your morning breakfast, not at home as is the norm in USA/Europe. Also, another surprising aspect is Padarias act like a low budget restaurant, serving a buffet with a complete meal at midday time.
And because Padarias are so important in Brazilian's life style, there are many of them, all over the city, and a question arrises: Where is the best place to open a Padaria in Zona Leste of São Paulo city ?
Lets see what we can find out.
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
import requests # library to handle requests
from bs4 import BeautifulSoup #to web scrap
The venues, categories and geographic coordinates data comes from ForSquare. Using ForuSquare API we will query FourSquare database using the near API entry point that allows us to get results around a place, around a borough. The boroughs names comes from a well reputed news site, globo.com.
Using the boroghs names from globo.com we get the venues details around each borough. This detailed data is used in our analysis and mapping, to answer the question asked: "Where is the best place to open a Padaria in Zona Leste of São Paulo city ?"
I've tried to find a geograpfic coordinates database of Borogh and Neighborhood in Brazil to use this information as entry-point in Foursquare, as we did before.
But no, I could't find one publicly available database on the internet. The Brazilian governamental site that should have this data didn't put it for general use in their site. You should fill a request explaining why you want that information. I fill the form but until now I didn't received a answer.
What I was able to find, was a list of Boroghs (in the city of São Paulo there are 5 boroghs and they are named 'Zona' and their names correspond to cardinal directions: 'Norte, Sul, Leste, Oeste, Centro' - North, South, East, West, Center) and their respective neighborhoods.
I verified that Foursquare API has another entry-point named near which I can pass a borogh name and receive the data around that place. Nice ! Lets use near with boroghs names, instead using borogh's geographical coordintes.
The list of boroghs and neighboorhoods came from a brazilian news site and I will use beautyfulsoup module to scrap their page.
url= 'http://g1.globo.com/sao-paulo/noticia/2013/11/veja-distribuicao-oficial-dos-bairros-nas-cinco-regioes-da-cidade.html'
response = requests.get(url)
if response.status_code!=200:
print (f"\npage {url} returned error status {response.status_code}")
print ("Aborting.")
quit ()
else:
print ("page loaded")
html= BeautifulSoup(response.text, 'html.parser')
# create the dataframe to host the data
boroghs_sp = pd.DataFrame(columns=('Borough', 'Neighborhood'))
i= 0 # index used to populate the dataframe rows
kind= 0 # flag to indicate if a block is a borogh or a neighborhood
# use BeautifulSoup to parse the HTML page
html= BeautifulSoup(response.text, 'html.parser')
# find the div id specified in HTML, and the inner div
# //*[@id="materia-letra"]
div= html.find ('div', id='materia-letra').find ('div')
blocks= div.find_all ('p')
# block[0] has a introdutory text
# block[i] has the borogh name
# block[i+1] has a neighborhood list
# iterate over all blocks starting in block[1]
borogh= True # starting with a borogh
for blk in blocks[1:]:
if borogh:
# borogh
b= blk.text
# set next element to be processed as a neighboorhood
borogh= not borogh
continue
else:
# neighborhood list
nbhd_list= blk.text.replace ('\t', '').split ('\n')
# set next element to be processed as a borogh
borogh= not borogh
# populate our dataframe, ignore empty strings
for nbhd in list (filter (None, nbhd_list)):
boroghs_sp.loc[i]= [b.strip (), nbhd.strip ()]
i= i+1
boroghs_sp
Above you can see a few neighborhoods in Central area of São Paulo city.
Let check the shape of our dataframe.
boroghs_sp.shape
Now we have the boroghs names I adjusted the Foursquare URL to use the near API entry point. I change the number of data to retrieve and extended the circle around the borogh as 500m looks like so small.
# initialize my Foursquare credentials
CLIENT_ID = 'FZDPFCII3LOEUDNI2GETXZM2T2AFWPKWZMDP4VX5X3OK4DFH' # my Foursquare ID
CLIENT_SECRET = 'L314JHHIHHYXUSGEXJU1UZ5JSXOWNHACRNWAAD5EWPFHJO5Q' # my Foursquare Secret
ACCESS_TOKEN = 'YJ55TLGLG5XTRKKQ0PZZ1RVVWVZ4EK4BWJLS2ZCDA100YDVJ' # my FourSquare Access Token
VERSION = '20180604'
LIMIT = 500
print ("Foursquare credentilas are set.")
def getNearbyVenues(df):
# radius around the lat,lng
radius=3000
# iterate over the input data frame getting the neighborhood name, escape any space on it.
# define the base Foursquare API URL using the near keyword
base_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&radius={}'
venues_list=[]
for i, r in df.iterrows ():
nbhd= r[1]
borogh= r[0]
near= nbhd + ",São Paulo,SP"
near= requests.utils.quote(near)
#print ("DEBUG: borogh:{}, nbhd: {}, near:{}".format (r[0], nbhd, near))
# create the request URL using the 'near' API endpoint
url= base_url.format(CLIENT_ID, CLIENT_SECRET, VERSION, near, radius)
# GET request
result = requests.get(url).json()
# check the result, skip if return code is not 200
if result['meta']['code']!=200:
#print (url)
print ("{}: Request return error code {}, error type {}".format (url, result['meta']['code'], result['meta']['errorType']))
#print (result['meta']['errorType'])
#print (result['meta']['errorDetail'])
#print ()
continue
# iterate over Foursquare data, get the vanue name and category
items= result["response"]['groups'][0]['items']
for i in items:
v= i['venue']
lat= v['location']['lat']
lng= v['location']['lng']
try:
name= v['name']
except KeyError:
name= 'NA'
try:
addr= v['location']['address']
except KeyError:
addr= 'NA'
try:
pc= v['location']['postalCode']
except KeyError:
pc= 'NA'
try:
nbhd= v['location']['neighborhood']
except KeyError:
# do nothing, use the nbhd set in input argument
pass
try:
categ= v['categories'][0]['name']
except KeyError:
categ= 'NA'
venue= (name, addr, pc, borogh, nbhd, categ, lat, lng)
venues_list.append(venue)
# return venues_list as a dataframe
nearby= pd.DataFrame(data= venues_list, columns= ['Venue', 'Address', 'PostalCode', 'Borogh', 'Neighborhood', 'Category', 'V_Lat', 'V_Long'])
return(nearby)
print ("This may take a while...be patient")
venues_SP= getNearbyVenues(boroghs_sp)
print ("...done.")
Looks like FourSquare didn't recognized some boroghs names, it's fine.
venues_SP.head()
venues_SP.shape
venues_SP['Neighborhood'].unique ()
venues_SP['Category'].unique ().shape
venues_SP.groupby('Borogh')['Venue'].count().sort_values(ascending=False)
venues_East= venues_SP[venues_SP['Borogh']=='Leste']
venues_East.groupby('Category')['Neighborhood'].count ().sort_values(ascending=False)
Even there are failed requests caused by unrecognized boroghs names, we are plenty of data to work with:
Our dataframe has:
Lets put on a map the venues returned by FourSquare, in each borogh (Zona) of São Paulo to see what we got.
To show each venue in a Borogh(Zona) lets add a borogh id to our dataframe. Later, we use that borogh id to pick a color for each borogh.
borogh_list= venues_SP.Borogh.unique()
borogh_ids= [i for i in range (0, len (borogh_list))]
venues_SP['borogh_id']= venues_SP['Borogh'].replace(to_replace=borogh_list, value=borogh_ids, inplace=False)
venues_SP.head ()
now we are ready to plot the venues by borogh. We just need the coordinates of SP to center it on the map.
# Here is a good point to sape our dataframes in a file, case we need go back without having to redo all this.
file= 'Battle-of_neighborhoods_w1_venues_SP.csv'
venues_SP.to_csv(file)
print('CSV file {} was saved!'.format (file))
#Save data so we can go back to this at any time.
file= 'Battle-of_neighborhoods_w1_boroghs_sp.csv'
boroghs_sp.to_csv(file)
print('CSV file {} was saved!'.format (file))
# Use this to load our dataframes saved above
###
### YOU CAN SKIP THIS IF YOU ARE RUNNING THIS NOTEBOOK FROM THE BEGINING
###
file= 'Battle-of_neighborhoods_w1_venues_SP.csv'
venues_SP= pd.read_csv(file)
venues_SP.drop ('Unnamed: 0', axis=1, inplace=True)
venues_SP.head ()
## Use this code to import the DataFrame saved above
file= 'Battle-of_neighborhoods_w1_boroghs_sp.csv'
boroghs_sp= pd.read_csv(file)
boroghs_sp.drop ('Unnamed: 0', axis=1, inplace=True)
# we need to redo theses as it are used ahead
borogh_list= venues_SP.Borogh.unique()
borogh_ids= [i for i in range (0, len (borogh_list))]
venues_East= venues_SP[venues_SP['Borogh']=='Leste']
venues_East.groupby('Category')['Neighborhood'].count ().sort_values(ascending=False)
#
boroghs_sp.head ()
Get the coordinates of São Paulo city.
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
interest= "São Paulo, SP, Brazil"
geolocator = Nominatim(user_agent="sp_explorer")
saopaulo = geolocator.geocode(interest)
print('The coordinates of São Paulo are {}, {}.'.format(saopaulo.latitude, saopaulo.longitude))
Let's define a function to draw a legend in the right-top corner of map.
def add_categorical_legend(folium_map, title, colors, labels):
if len(colors) != len(labels):
raise ValueError("colors and labels must have the same length.")
color_by_label = dict(zip(labels, colors))
legend_categories = ""
for label, color in color_by_label.items():
legend_categories += f"<li><span style='background:{color}'></span>{label}</li>"
legend_html = f"""
<div id='maplegend' class='maplegend'>
<div class='legend-title'>{title}</div>
<div class='legend-scale'>
<ul class='legend-labels'>
{legend_categories}
</ul>
</div>
</div>
"""
script = f"""
<script type="text/javascript">
var oneTimeExecution = (function() {{
var executed = false;
return function() {{
if (!executed) {{
var checkExist = setInterval(function() {{
if ((document.getElementsByClassName('leaflet-top leaflet-right').length) || (!executed)) {{
document.getElementsByClassName('leaflet-top leaflet-right')[0].style.display = "flex"
document.getElementsByClassName('leaflet-top leaflet-right')[0].style.flexDirection = "column"
document.getElementsByClassName('leaflet-top leaflet-right')[0].innerHTML += `{legend_html}`;
clearInterval(checkExist);
executed = true;
}}
}}, 100);
}}
}};
}})();
oneTimeExecution()
</script>
"""
css = """
<style type='text/css'>
.maplegend {
z-index:9999;
float:right;
background-color: rgba(255, 255, 255, 1);
border-radius: 5px;
border: 2px solid #bbb;
padding: 10px;
font-size:12px;
positon: relative;
}
.maplegend .legend-title {
text-align: left;
margin-bottom: 5px;
font-weight: bold;
font-size: 90%;
}
.maplegend .legend-scale ul {
margin: 0;
margin-bottom: 5px;
padding: 0;
float: left;
list-style: none;
}
.maplegend .legend-scale ul li {
font-size: 80%;
list-style: none;
margin-left: 0;
line-height: 18px;
margin-bottom: 2px;
}
.maplegend ul.legend-labels li span {
display: block;
float: left;
height: 16px;
width: 30px;
margin-right: 5px;
margin-left: 0;
border: 0px solid #ccc;
}
.maplegend .legend-source {
font-size: 80%;
color: #777;
clear: both;
}
.maplegend a {
color: #777;
}
</style>
"""
folium_map.get_root().header.add_child(folium.Element(script + css))
return folium_map
import matplotlib.cm as cm
import folium # plotting library
import matplotlib.colors as colors
borogh_names= venues_SP.Borogh.unique()
# n= number of Boroughs
n= len (borogh_ids)
# create map
map_SP = folium.Map(location=[saopaulo.latitude, saopaulo.longitude], zoom_start=11)
title= "Venues by Boroghs in São Paulo"
subtitle= "marker: neighborhood: venue(category)"
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
<h2 align="center" style="font-size:14px"><b>{}</b></h2>
'''.format(title, subtitle)
map_SP.get_root().html.add_child(folium.Element(title_html))
# set color scheme for the clusters
x = np.arange(n)
ys = [i + x + (i*x)**2 for i in range(n)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add venues to the map
# Venue Address PostalCode Neighborhood Category V_Lat V_Long
for name, category, b_id, nbhd, v_lat, v_lon, in zip(venues_SP['Venue'], venues_SP['Category'], venues_SP['borogh_id'], venues_SP['Neighborhood'], venues_SP['V_Lat'], venues_SP['V_Long']):
label = '{}: {}({})'.format(nbhd, name, category)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[v_lat, v_lon],
radius=3,
popup=label,
color=rainbow[b_id],
fill=True,
fill_color=rainbow[b_id],
fill_opacity=0.6).add_to(map_SP)
# add a legend on the right upper corner
borogh_colors= []
borogh_legends= []
for i in range (n):
borogh_colors.append (rainbow[i])
nbhd=borogh_list[i]
borogh_legends.append (nbhd)
map_SP= add_categorical_legend(map_SP, 'Boroghs',
colors = borogh_colors,
labels = borogh_legends)
print ("done")
map_SP
Nice ! We clearly see the venues in the zones North, South, East, West and Center of São Paulo.
There is a minor glitch regarding the some venues placed in the North-East as West zone. This is because they are in fact, in another county (Guarulhos), and somehow it was accounted as it is in São Paulo. In fact, that cluster is in West of Guarulhos city. They don't affect our analysis, so let it aside.
Now we are sure our data is good, let move on to focus our analysis on the bakeries in Zona Leste (East) of São Paulo.
Lets filter our dataframe to focus on bakeries in Zona Lest of São Paulo.
I will create two dataframes to work on it. The first one has only with Bakeries on it and the other has everything else.
The reasoning of having a bakery dataset and a not-a-bakery dataset is I want to map the bakeries along the other business in the area.
As this kind of business needs a great flux of people near by, the higher the number of others business, better to a future place for a bakery.
So, our criteria to choose a place to out a new bakery is a place near other business and far from other bakeries in the area.
bakeries_East= venues_East[venues_East.Category=='Bakery']
bakeries_East.shape
notBakery= bakeries_East.copy ()
notBakery.shape
notBakery= venues_East[~venues_East.Category.str.contains ('Bakery')]
notBakery.Category.unique ()
len (notBakery.Category.unique ())
This map will show us were the bakeries and the not-a-bakery are in Zona Lest of SP.
Then, we can, visually, find a place where there are other business around and no bakeries near by, which is, supposely, a good place to start a Padaria, in Zona Lest of São Paulo.
import matplotlib.cm as cm
import folium # plotting library
import matplotlib.colors as colors
borogh_names= venues_SP.Borogh.unique()
# n= number of business: bakeries/not bakery
n= 2
# East of São Paulo coordinates
leste_latitude= -23.5471308
leste_longitude= -46.4970469
# create bakeries map
# bakeries_East
bakeries_map = folium.Map(location=[leste_latitude, leste_longitude], zoom_start=13, control_scale= True)
title= "Bakeries and not bakeries in East of São Paulo"
subtitle= "marker: neighborhood: venue(category)"
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
<h2 align="center" style="font-size:14px"><b>{}</b></h2>
'''.format(title, subtitle)
bakeries_map.get_root().html.add_child(folium.Element(title_html))
# set color scheme for the clusters
x = np.arange(n)
ys = [i + x + (i*x)**2 for i in range(n)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add bakeries to the map
# Venue Address PostalCode Neighborhood Category V_Lat V_Long
for name, category, nbhd, v_lat, v_lon, in zip(bakeries_East['Venue'], bakeries_East['Category'], bakeries_East['Neighborhood'], bakeries_East['V_Lat'], bakeries_East['V_Long']):
b_id= 0
label = '{}: {}({})'.format(nbhd, name, category)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[v_lat, v_lon],
radius=4,
popup=label,
color=rainbow[b_id],
fill=False,
fill_color=rainbow[b_id],
fill_opacity=0.6).add_to(bakeries_map)
# add not-bakery to the map
# Venue Address PostalCode Neighborhood Category V_Lat V_Long
for name, category, nbhd, v_lat, v_lon, in zip(notBakery['Venue'], notBakery['Category'], notBakery['Neighborhood'], notBakery['V_Lat'], notBakery['V_Long']):
b_id= 1
label = '{}: {}({})'.format(nbhd, name, category)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[v_lat, v_lon],
radius=3,
popup=label,
color=rainbow[b_id],
fill=True,
fill_color=rainbow[b_id],
fill_opacity=0.6).add_to(bakeries_map)
# add a legend on the right upper corner
bakery_colors= []
bakery_colors.append (rainbow[0])
bakery_colors.append (rainbow[1])
bakery_legends= ['Bakery', 'not a Bakery']
bakeries_map= add_categorical_legend(bakeries_map, 'Business type',
colors = bakery_colors,
labels = bakery_legends)
print ("done")
bakeries_map
This study started with the assumption that the stackholder wanted a new business in the East zone of São Paulo. However this code could easly adapted to do this analysis in any other borough of choice.
If there was not a preferential borogh to start a padaria we could do a quick study about wich borough is most suitable to receive a padaria based on the relative proportion of bakary/not-bakary ratio in each borough and select the borough with the lowest ratio.
And not only bakaries !
We can adapt this study to deal with any other kind of business that a stackholder is willing to start.
As we can see in the above map, there are a few places where you can start a Padaria that is located near from other business and far from others padarias. There are even "voids" of other business where potentially can have a padaria. Altough theses regions are empty of others business, for sure, they are populated areas that will welcome a new padaria in their neighborhood.